Credit Card Users Churn Prediction

Description

Background & Context

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards

You need to identify the best possible model that will give the required performance

Objective

Explore and visualize the dataset. Build a classification model to predict if the customer is going to churn or not Optimize the model using appropriate techniques Generate a set of insights and recommendations that will help the bank

Data Dictionary:

Solution Approach

Solution

Understand Given Data

Read given data to data frame and understand data nature, given features, total records, given data has any missing values or duplicate data, outliers.

Visualize data and and understand data range and outliers

Loading necessary libraries for EDA

Load all standard python library packages.

Data Manipulation

Data Visualization

Load data to dataframe and check data & data type

Read given csv file BankChurners.csv and load to data frame data.

View the first and last 5 rows of the dataset.

few categorical features and mostly numerical features

Understand the shape of the dataset.

observations on data

Check the data types of the columns in the dataset.

checking data types of all columns

Drop CLIENTNUM Column

Since CLIENTNUM has no relation with other features and it is row number we can drop this column

observations on data summary

Check for Duplicates Values

lets check for any duplicate values

No Duplicate Data found, No actions reqd.

Let's check for missing values

lets check which columns has some null values, how many null values

observations on data missing

We dont want to delete missing values. We have to treat these missing values so that we dont lose those data

Checking columns to see how much of values are missing

observations on data missing by row

checking data with 2 missing values and 1 missing value

Lets fix all missing values on data preprocessing step in missing value treatment.

Lets check what are different values we have for all category features

Exploratory Data Analysis And Data processing

Univariate analysis & Bivariate analysis

Visualize all features before any data clean up and understand what data needs cleaning and fixing.

Analysis on features and relation with Target feature

Univariate analysis helps to check data skewness and possible outliers and spread of the data. Bivariate analysis helps to check data relation between two features.

creating a method that can plot univariate chart with histplot, boxplot and barchart %

Cheking data falls outside IQR - 3 *IQR Range

Analysis on Customer_age

Observations on Customer_age

Analysis on Dependent_count

Observations on Dependent_count

Analysis on Months_on_book

Observations on Months_on_book

Analysis on Total_Relationship_Count

Observations on Total_Relationship_Count

Analysis on Months_Inactive_12_mon

Observations on Months_Inactive_12_mon

Analysis on Contacts_Count_12_mon

Observations on Contacts_Count_12_mon

Analysis on Credit_Limit

Observations on Credit_Limit

lets check outlier data for Credit_Limit

observations on outliers

Log transform Credit Limit to change data distribution

log transformed credit limit showing better data distribution.

lets add log transformed credit limit and rerun initial analysis on that feature

Analysis on Credit_Limit_Log

observation : mean and median almost matching. And no outliers shown in box plot

Analysis on Total_Revolving_Bal

Observations on Total_Revolving_Bal

Log transform Total Revolving Bal to change data distribution

Observations on Total_Revolving_Bal_log

Analysis on Avg_Open_To_Buy

Observations on Avg_Open_To_Buy

Log transformation on Avg_Open_To_Buy

Observations on Avg_Open_To_Buy_log

Analysis on Total_Amt_Chng_Q4_Q1

Observations on Total_Amt_Chng_Q4_Q1

Analysis on Total_Trans_Amt

Observations on Total_Trans_Amt

Log transformation on Total_Trans_Amt

Observations on Total_Trans_Amt

Analysis on Total_Trans_Ct

Observations on Total_Trans_Ct

Analysis on Total_Ct_Chng_Q4_Q1

Observations on Total_Ct_Chng_Q4_Q1

Analysis on Avg_Utilization_Ratio

Observations on Avg_Utilization_Ratio

Analysis on Attrition_Flag

Observations on Attrition_Flag

Analysis on Gender

Observations on Gender

Analysis on Education_Level

Observations on Education_Level

Analysis on Marital_Status

Observations on Marital_Status

Analysis on Income_Category

Observations on Income_Category

Analysis on Card_Category

Observations on Card_Category

Some features are listed as objects we can change these to category types

Saving memory space

convert object to category types

observations on data types

saved .4MB space after changing all object to category values

Given Dataset has 6 Category features and 14 numerical features

Bivariate Analysis

Data correlation analysis

observations on heatmap

Pair Plot

observations on pairplot

All feature are showing some relation with Total_Amt_Chng_Q4_Q1, Total_Ct_Chng_Q4_Q1,Total_Trans_Ct and Total_Trans_Amt - we have to study this more to understand

Checking High Corelated features

Total_Trans_Amt_log vs Total_Trans_Ct

Observations on high corelated features

Dropping highly corelated features

Dropping Contacts_Count_12_mon - since its past data

Checking Gender effects on Attrition

We dont really see any significant effects on Attrition by Male or Female

Checking Education_Level effects on Attrition

Checking Marital_Status effects on Attrition

Checking Income_Category effects on Attrition

Checking Card_Category effects on Attrition

Attrition_Flag Vs Numerical Features

Observations

Attrition_Flag Vs Education & Income Features with

Observations

Summary Missing Value/Outlier/Feature Engineering Treatments

Model Building - Approach

Model evaluation criterion:

Model can make wrong predictions as:

  1. Predicting a customer will close the card and the customer doesn't close - Loss of resources
  2. Predicting a customer will not close the card and the customer close card - Loss of opportunity

Which case is more important?

How to reduce this loss i.e need to reduce False Negatives?

  1. Data preparation
  2. Partition the data into train, validation and test set.
  3. Build 6 model on the train data with regular, oversampled and undersampled data.
  4. Pick 3 best models and hypertune parameters the models.
  5. Pick Best model and score against test set
  6. Productionalize model - implement steps with pipeline

Data Preparation for Modeling

Creating Dummy Variables

Building the model

Let's start by building different models using KFold and cross_val_score and tune the best model using GridSearchCV and RandomizedSearchCV

Importing Libraries & Methods to track Recall/Accuracy/Confusion Matrix

Building Methods to Score/Confusion matrix

Building 6 Models with Basic tuning parameters

To avoid overfit on decision tree and random forests

Running 6 models with - Regular Data

Models Performance Summary - Regular data

Best Models with Regular data

Oversampling train data using SMOTE

Running 6 models with - Over Sampled Data

Models Performance Summary - Over Sampled Data

Best Models with Over Sampled Data

Undersampling train data using Random Under Sampler

Running 6 models with - Under Sampled Data

Models Performance Summary - Under Sampled Data

Best Models with Under Sampled Data

Three Best Models out of 18 Models

All Models on Over Sampling and Under Sampling shows overfit

Hyperparameter Tuning

We will tune Decision Tree Classifier, Bagging Classifier and Gradient Boosting Classifier models using GridSearchCV and RandomizedSearchCV. We will also compare the performance and time taken by these two methods - grid search and randomized search.

Decision Tree - Tuning Model

Common Parameters grid for GridSearchCV & RandomizedSearchCV

Validating with GridSearchCV

Validating with RandomizedSearchCV

Bagging Classifier - Tuning Model

Common Parameters grid for GridSearchCV & RandomizedSearchCV

Validating with GridSearchCV

Validating with RandomizedSearchCV

Gradient Boosting Classifier - Tuning Model

Common Parameters grid for GridSearchCV & RandomizedSearchCV

Let’s first fit a gradient boosting classifier with default parameters to get a baseline idea of the performance

Base Model does not overfit, But we want to improve performance

Validating with GridSearchCV

Validating with RandomizedSearchCV

Model Performances

Comparing models After Tuning

Comparing Gradient Boosting GridsearchCV and RandomisedsearchCV - Both results are showing same numbers in all metrics

Both GridsearchCV and RandomisedsearchCV picked same param combinations

We can pick RandomisedsearchCV model for productionizing & pipeline

Checking Performance on Train/Validation & Test Data sets

Metrics with Training & Validation data set

Metrics with Training & Test data set

Feature importance - Gradient Boosting Tuned Model

Important Features

Pipelines for productionizing the model

Column Transformer

Numerical Pipeline

Categorical Pipeline

Pipelines Approach

Import Libraries

Create Pipeline

Split & Prepare Data

Build Model

Creating new pipeline with best parameters

Pipeline Model Metrics with Training & Test data set

Observations

Business Recommendations

Model Predicts customer attrition 95% and more. Bank should focus on customer that shows as possible attrition by contacting them understand their concerns and provide a workaround. This would help reating that customer account.